Functions and Character Manipulation

1 Functions

When you create a function, it defines a separate environment and the variables you create inside your function only exist in that function environment; when you return to where you called the function from, those variables no longer exist. You can refer to other objects that are in the calling environment, but if you make any changes to them, the changes will only take place in the function environment. To get information back to the calling environment, you must pass a return value, which will be available through the functions name. R will automatically return the last unassigned value it encounters in your function, or you can place the object you want to return in a call to the return function. You can only return a single object from a function in R; if you need to return multiple objects, you need to return a list containing those objects, and extract them from the list when you return to the calling environment.

As a simple example of a function that returns a value, suppose we want to calculate the ratio of the maximum value of a vector to the minimum value of the vector. Here's a function definition that will do the job:

maxminratio = function(x)max(x)/min(x)

Notice for a single line function you don't need to use brackets ({}) around the function body, but you are free to do so if you like. Since the final statement wasn't assigned to a variable, it will be used as a return value when the function is called. Alternatively, the value could be placed in a call to the return function. If we wanted to find the max to min ratio for all the columns of the matrix, we could use our function with the apply function:

apply(mymat,2,maxminratio)

The 2 in the call to apply tells it to operate on the columns of the matrix; a 1 would be used to work on the rows.

Before we leave this example, it should be pointed out that this function has a weakness - what if we pass it a vector that has missing values? Since we're calling min and max without the na.rm=TRUE argument, we'll always get a missing value if our input data has any missing values. One way to solve the problem is to just put the na.rm=TRUE argument into the calls to min and max. A better way would be to create a new argument with a default value. That way, we still only have to pass one argument to our function, but we can modify the na.rm= argument if we need to.

maxminratio = function(x,na.rm=TRUE)max(x,na.rm=na.rm)/min(x,na.rm=na.rm)

If you look at the function definitions for functions in R, you'll see that many of them use this method of setting defaults in the argument list.

As another example of a function, recall the graph of income versus literacy with different colored points for the different continents. If we were working with the datasets like the world1 dataset and wanted to create a variety of plots, we could write a function like this:

worldplotter = function(data,xvar,yvar,cvar,colors,ltitle=cvar,legendloc='topleft'){
   colorvar = factor(data[,cvar])
   with(data,plot(data[,xvar],data[,yvar],col=colors[colorvar],xlab=xvar,ylab=yvar))
   with(data,legend(legendloc,legend=levels(colorvar),col=colors,pch=1,title=ltitle))
}

Now we could produce the income versus literacy graph by calling:

worldplotter(world1,'literacy','income','cont',c('red','blue','green','orange','yellow','violet'))

By changing the arguments, a variety of plots can be produced.

As your functions get longer and more complex, it becomes more difficult to simply type them into an interactive R session. To make it easy to edit functions, R provides the edit command, which will open an editor appropriate to your operating system. When you close the editor, the edit function will return the edited copy of your function, so it's important to remember to assign the return value from edit to the function's name. If you've already defined a function, you can edit it by simply passing it to edit, as in

minmaxratio = edit(minmaxratio)

You may also want to consider the fix function, which automates the process slightly.

To start from scratch, you can use a call to edit like this:

newfunction = edit(function(){})

2 Sizes of Objects

Before we start looking at character manipulation, this is a good time to review the different functions that give us the size of an object.

length - returns the length of a vector, or the total number of elements in a matrix (number of rows times number of columns). For a data frame, returns the number of columns.
dim - for matrices and data frames, returns a vector of length 2 containing the number of rows and the number of columns. For a vector, returns NULL. The convenience functions nrow and ncol return the individual values that would be returned by dim.
nchar - for a character string, returns the number of characters in the string. Returns a vector of values when applied to a vector of character strings. For a numeric value, nchar returns the number of characters in the printed representation of the number.

3 Character Manipulation

While it's quite natural to think of data as being numbers, manipulating character strings is also an important skill when working with data. We've already seen a few simple examples, such as choosing the right format for a character variable that represents a date, or using table to tabulate the occurences of different character values for a variable. Now we're going to look at some functions in R that let us break apart, rearrange and put together character data.

One of the most important uses of character manipulation is "massaging" data into shape. Many times the data that is available to us, for example on a web page or as output from another program, isn't in a form that a program like R can easily interpret. In cases like that, we'll need to remove the parts that R can't understand, and organize the remaining parts so that R can read them efficiently.

Let's take a look at some of the functions that R offers for working with character variables:

paste The paste function converts its arguments to character before operating on them, so you can pass both numbers and strings to the function. It concatenates the arguments passed to it, to create new strings that are combinations of other strings. paste accepts an unlimited number of unnamed arguments, which will be pasted together, and one or both of the arguments sep= and collapse=. Depending on whether the arguments are scalars or vectors, and which of sep= and collapse= are used, a variety of different tasks can be performed.
1. If you pass a single argument to paste, it will return a character representation:
```
> paste('cat')
[1] "cat"
> paste(14)
[1] "14"
```
2. If you pass more than one scalar argument to paste, it will put them together in a single string, using the sep= argument to separate the pieces:
```
> paste('stat',133,'assignment')
[1] "stat 133 assignment"
```
3. If you pass a vector of character values to paste, and the collapse= argument is not NULL, it pastes together the elements of the vector, using the collapse= argument as a separator:
```
> paste(c('stat',133,'assignment'),collapse=' ')
[1] "stat 133 assignment"
```
4. If you pass more than one argument to paste, and any of those arguments is a vector, paste will return a vector as long as its' longest argument, produced by pasting together corresponding pieces of the arguments. (Remember the recycling rule which will be used if the vector arguments are of different lengths.) Here are a few examples:
```
> paste('x',1:10,sep='')
 [1] "x1"  "x2"  "x3"  "x4"  "x5"  "x6"  "x7"  "x8"  "x9"  "x10"
> paste(c('x','y'),1:10,sep='')
 [1] "x1"  "y2"  "x3"  "y4"  "x5"  "y6"  "x7"  "y8"  "x9"  "y10"
```
grep The grep function searches for patterns in text. The first argument to grep is a text string or regular expression that you're looking for, and the second argument is usually a vector of character values. grep returns the indices of those elements of the vector of character strings that contain the text string. Right now we'll limit ourselves to simple patterns, but later we'll explore the full strength of commands like this with regular expressions.
grep can be used in a number of ways. Suppose we want to see the countries of the world that have the world 'United' in their names.
```
> grep('United',world1$country) 
[1] 144 145
```
grep returns the indices of the observations that have 'United' in their names. If we wanted to see the values of country that had 'United' in their names, we can use the value=TRUE argument:
```
> grep('United',world1$country,value=TRUE)
[1] "United Arab Emirates" "United Kingdom"
```
Notice that, since the first form of grep returns a vector of indices, we can use it as a subscript to get all the information about the countries that have 'United' in their names:
```
> world1[grep('United',world1$country),]
                 country   gdp income literacy    military cont
144 United Arab Emirates 23200  23818     77.3  1600000000   AS
145       United Kingdom 27700  28938     99.9 42836500000   EU
```
grep has a few optional arguments, some of which we'll look at later. One convenient argument is ignore.case=TRUE, which, as the name implies will look for the pattern we specified without regard to case.
strsplit strsplit takes a character vector, and breaks each element up into pieces, based on the value of the split= argument. This argument can be an ordinary text string, or a regular expression. Since the different elements of the vector may have different numbers of "pieces", the results from strsplit are always returned in a list. Here's a simple example:
```
> mystrings = c('the cat in the hat','green eggs and ham','fox in socks')
> parts = strsplit(mystrings,' ')
> parts 
[[1]]
[1] "the" "cat" "in"  "the" "hat"

[[2]]
[1] "green" "eggs"  "and"   "ham"

[[3]]
[1] "fox"   "in"    "socks"
```
While we haven't dealt much with lists before, one function that can be very useful is sapply; you can use sapply to operate on each element of a list, and it will, if possible, return the result as a vector. So to find the number of words in each of the character strings in mystrings, we could use:
```
> sapply(parts,length)
[1] 5 4 3
```
substring The substring function allows you to extract portions of a character string. Its first argument is a character string, or vector of character strings, and its second argument is the index (starting with 1) of the beginning of the desired substring. With no third argument, substring returns the string starting at the specified index and continuing to the end of the string; if a third argument is given, it represents the last index of the original string that will be included in the returned substring. Like many functions in R, its true value is that it is fully vectorized: you can extract substrings of a vector of character values in a single call. Here's an example of a simple use of substring
```
> strings = c('elephant','aardvark','chicken','dog','duck','frog')
> substring(strings,1,5)
[1] "eleph" "aardv" "chick" "dog"   "duck"  "frog"
```
Notice that, when a string is too short to fully meet a substringing request, no error or warning is raised, and substring returns as much os the string as is there.
Consider the following example, extracted from a web page. Each element of the character vector data consists of a name followed by five numbers. Extracting an individual field, say the field with the state names is straight forward:
```
> data = c("Lyndhurst      Ohio          199.02  15,074  30  5   25",
           "Southport Town New York      217.69  11,025  24  4   20",
           "Bedford        Massachusetts 221.20  12,658  28  0   28")
> states = substring(data,16,28)
> states
[1] "Ohio         " "New York     " "Massachusetts"
```
It is possible to extract all the fields at once, at the cost of a considerably more complex call to substring:
```
> starts = c(1,16,30,38,46,50,54)
> ends   = c(14,28,35,43,47,50,55)
> ldata = length(data)
> lstarts = length(starts)
> x = substring(data,rep(starts,rep(ldata,lstarts)),rep(ends,rep(ldata,lstarts)))
> matrix(x,ncol=lstarts)
     [,1]             [,2]            [,3]     [,4]     [,5] [,6] [,7]
[1,] "Lyndhurst     " "Ohio         " "199.02" "15,074" "30" "5"  "25"
[2,] "Southport Town" "New York     " "217.69" "11,025" "24" "4"  "20"
[3,] "Bedford       " "Massachusetts" "221.20" "12,658" "28" "0"  "28"
```
Like many functions in R, substring can appear on the left hand side of an assignment statement, making it easy to change parts of a character string based on the positions they're in. To change the third through fifth digits of a set of character strings representing numbers to 99, we could use:
```
> nums = c('12553','73911','842099','203','10')
> substring(nums,3,5) = '99'
> nums
[1] "12993"  "73991"  "849999" "209"    "10"
```
tolower, toupper These functions convert their arguments to all upper-case characters or all lower-case characters, respectively
sub, gsub These functions change a regular expression or text pattern to a different set of characters. They differ in that sub only changes the first occurence of the specified pattern, while gsub changes all of the occurences. Since numeric values in R cannot contain dollar signs or commas, one important use of gsub is to create numeric variables from text variables that represent numbers but contain commas or dollars. For example, in gathering the data for the world dataset that we've been using, I extracted the information about military spending from http://en.wikipedia.org/wiki/List_of_countries_by_military_expenditures. Here's an excerpt of some of the values from that page:
```
> values = c('370,700,000,000','205,326,700,000','67,490,000,000')
> as.numeric(values)
[1] NA NA NA
Warning message:
NAs introduced by coercion
```
The presence of the commas is preventing R from being able to convert the values into actual numbers. gsub easily solves the problem:
```
> as.numeric(gsub(',','',values))
[1] 370700000000 205326700000  67490000000
```

4 Working with Characters

As you probably noticed when looking at the above functions, they are very simple, and, quite frankly, it's hard to see how they could really do anything complex on their own. In fact, that's just the point of these functions - they can be combined together to do just about anything you would want to do. As an example, consider the task of capitalizing the first character of each word in a string. The toupper function can change the case of all the characters in a string, but we'll need to do something to separate out the characters so we can get the first one. If we call strsplit with an empty string for the splitting character, we'll get back a vector of the individual characters:

> str = 'sherlock holmes'
> letters = strsplit(str,'')
> letters
[[1]]
 [1] "s" "h" "e" "r" "l" "o" "c" "k" " " "h" "o" "l" "m" "e" "s"
> theletters = letters[[1]]

Notice that strsplit always returns a list. This will be very useful later, but for now we'll extract the first element before we try to work with its output.

The places that we'll need to capitalize things are the first position in the vector or letters, and any letter that comes after a blank. We can find those positions very easily:

> wh = c(1,which(theletters == ' ') + 1)
> wh
[1]  1 10

We can change the case of the letters whose indexes are in wh, then use paste to put the string back together.

> theletters[wh] = toupper(theletters[wh])
> paste(theletters,collapse='')
[1] "Sherlock Holmes"

Things have gotten complicated enough that we could probably stand to write a function:

maketitle = function(txt){
  theletters = strsplit(txt,'')[[1]]
  wh = c(1,which(theletters  == ' ') + 1)
  theletters[wh] = toupper(theletters[wh])
  paste(theletters,collapse='')
}

Of course, we should always test our functions:

> maketitle('some crazy title')
[1] "Some Crazy Title"

Now suppose we have a vector of strings:

> titls = c('sherlock holmes','avatar','book of eli','up in the air')

We can always hope that we'll get the right answer if we just use our function:

> maketitle(titls)
[1] "Sherlock Holmes"

Unfortunately, it didn't work in this case. Whenever that happens, sapply will operate on all the elements in the vector:

> sapply(titls,maketitle)
  sherlock holmes            avatar       book of eli     up in the air 
"Sherlock Holmes"          "Avatar"     "Book Of Eli"   "Up In The Air"

Of course, this isn't the only way to solve the problem. Rather than break up the string into individual letters, we can break it up into words, and capitalize the first letter of each, then combine them back together. Let's explore that approach:

> str = 'sherlock holmes'
> words = strsplit(str,' ')
> words
[[1]]
[1] "sherlock" "holmes"

Now we can use the assignment form of the substring function to change the first letter of each word to a capital. Note that we have to make sure to actually return the modified string from our call to sapply, so we insure that the last statement in our function returns the string:

> sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
  sherlock     holmes 
"Sherlock"   "Holmes"

Now we can paste the pieces back together to get our answer:

> res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
> paste(res,collapse=' ')
[1] "Sherlock Holmes"

To operate on a vector of strings, we'll need to incorporate these steps into a function, and then call sapply:

mktitl = function(str){
   words = strsplit(str,' ')
   res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
   paste(res,collapse=' ')
}

We can test the function, making sure to use a string different than the one we used in our initial test:

> mktitl('some silly string')
[1] "Some Silly String"

And now we can test it on the vector of strings:

> titls = c('sherlock holmes','avatar','book of eli','up in the air')
> sapply(titls,mktitl)
  sherlock holmes            avatar       book of eli     up in the air 
"Sherlock Holmes"          "Avatar"     "Book Of Eli"   "Up In The Air"

How can we compare the two methods? The R function system.time will report the amount of time any operation in R uses. One important caveat - if you wish to assign an expression to a value in the system.time call, you must use the "<-" assignment operator, because the equal sign will confuse the function into thinking you're specifying a named parameter in the function call. Let's try system.time on our two functions:

> system.time(one <- maketitle(titls))
   user  system elapsed 
      0       0       0 
> system.time(two <- mktitl(titls))
   user  system elapsed 
  0.000   0.000   0.001

For such a tiny example, we can't really trust that the difference we see is real. Let's use the movie names from a previous example:

> movies = read.delim('http://www.stat.berkeley.edu/classes/s133/data/movies.txt',
+ sep='|',stringsAsFactors=FALSE)
> nms = tolower(movies$name)
> system.time(one <- maketitle(nms))
   user  system elapsed 
  0.000   0.000   0.001 
> system.time(two <- mktitl(nms))
   user  system elapsed 
  0.008   0.000   0.007

It looks like the first method is better than the second. Of course, if they don't get the same answer, it doesn't really matter how fast they are. In R, the all.equal function can be used to see if things are the same:

> all.equal(one,two)
[1] TRUE

File translated from T_EX by T_TH, version 3.67.
On 8 Feb 2010, 13:59.